Abstract

This study investigates the spatial distribution of assault incidents in Toronto in 2023, with a focus on the potential clustering of assaults near Toronto Transit Commission (TTC) subway routes. Utilizing assault occurrence data from the Toronto Police Service and TTC subway route, a spatial analysis was conducted to determine whether proximity to subway infrastructure influences the intensity and clustering of assault incidents.

In this study, we create different buffers around subway routes and classified incidents as occurring near or away from subway routes accordingly. In other words, we define the idea of “near TTC subway routes” using different buffer sizes. Preliminary findings indicated that 51.3% of assaults occurred within the 1km buffer, with higher densities observed near Line 1 and Line 2 routes compared to Lines 3 and 4. The study employs point pattern analysis and spatial modeling, including testing of complete spatial randomness, kernel density estimation, point process modelling, cluster detection through HDBSCAN, to evaluate spatial dependence.

The spatial analysis of Toronto assault cases reveals significant clustering patterns, rejecting the null hypothesis of complete spatial randomness (CSR) across all tests. Kolmogorov-Smirnov tests and Ripley’s K-function demonstrate significant clustering of assault cases in Toronto and degree of clustering depends on whether the assault occurred within the TTC subway route buffer (i.e. near TTC subway). Comparison of Ripley’s K-function on different within buffer subsets reveals that clustering patterns depend on the buffer size as well (i.e. proximity to TTC subway route). Clustering analysis on different data subsets through HDBSCAN shows that assault incidents near TTC subway routes remain concentrated within 1-2 km buffers and retain their spatial patterns when expanded to 5 km and the entire city. Larger datasets revealed additional smaller clusters, suggesting that TTC subway routes may influence the distribution of assault incidents.

Research findings can provide insights for urban planners and policymakers, informing strategies to enhance public safety around transit infrastructure. This research contributes to understanding the interplay between urban crime patterns and public transit systems, with implications for cities beyond Toronto.

1. Introduction

1.1 Background

Public safety is a central concern in urban planning, particularly in cities like Toronto where public transit systems, such as the Toronto Transit Commission (TTC), are heavily utilized. With millions of commuters relying on the subway system daily, it is crucial to ensure that these transit corridors are safe for all users. Assault cases occurring in and around transit hubs have raised concerns about whether the infrastructure and social environment surrounding subway lines contribute to crime. The subway lines, as major transit arteries, can influence the surrounding environment in various ways. High volumes of pedestrian traffic, densely populated areas, and varied socio-economic conditions around the stations might contribute to spatial patterns of crime that differ from other parts of the city.

1.2 Research question

This research seeks to investigate whether there is a significant clustering of assault cases occurring near TTC subway routes compared to other areas in Toronto. We aim to explore whether the spatial distribution of assault incidents follows random patterns across Toronto or shows a tendency to concentrate in proximity to TTC subway lines.

1.3 Importance of this study

The findings provide insights into how transit infrastructure may interact with urban crime patterns and offer recommendations for transit authorities and policymakers to improve safety measures. If significant clustering is detected, the results could influence the allocation of law enforcement resources, guiding increased security measures around certain subway stations. Urban planners could also use the findings to design safer transit environments, incorporating infrastructure and urban designs that discourage crime. Furthermore, this research could have broader implications for understanding the relationship between public transit systems and urban crime, contributing to policies aimed at enhancing public safety not just in Toronto, but in other urban centers with similar transit networks.

2. Data

2.1 Study domain

The study focuses on spatial analysis of assault incidents in Toronto in 2023, specifically examining their spatial distribution of assault incidents across transit zones (TTC subway buffer zones) to understand the clustering patterns of assaults in relation to transit infrastructure.

2.2 Datasets

We utilize two datasets:

2.21 Toronto Police Service Assault Occurrences

This dataset includes records of assault occurrences in Toronto reported to the Toronto Police Service since 2014. The data is point process data, with each observation representing an individual assault incident’s spatial location with coordinates (longitude and latitude) using the WGS84 datum. The full dataset consists of 206956 observations. Aside from the spatial information about the incident, other attributes include the followings:

  • unique event id
  • incident occurred date (year, month, day) and time
  • incident reported date (year, month, day) and time
  • police station division of which the incident occurred
  • type of assault offence (e.g. Assault With Weapon, Aggravated Assault, Discharge Firearm - Recklessly, etc)
  • neighborhood of which the assault occurred
  • location type of which the assault occurred (e.g. hospitals, restaurants, parking lots, etc)
  • premises type of which the assault occurred (e.g. apartment, transit, house, educational, outside, etc)

For this study, we focus on incidents occurred in 2023 only. The reduced dataset includes 6,478 distinct locations where assaults occurred in 2023, and in total, there are 23,639 individual assault incidents recorded for this year. Note that multiple assaults may have occurred at the same locations. To answer our main research question, we only require the spatial location of the point observations that denotes an occurrence of an assault incident.

2.22 TTC Subway Routes

This dataset, provided by the City of Toronto, contains geographic shapefiles representing the Toronto Transit Commission (TTC) subway lines. The spatial data is line data, representing each subway line’s path through geographic coordinates (longitude and latitude) using the WGS84 datum. This dataset consists of 4 observations, each representing one subway line within Toronto’s transit system. Aside from the spatial information about the incident, other attributes include the followings:

  • subway route name
  • subway route number

For this study, we consider all 4 observations in our study domain since we are interested in to understand incident patterns near all subway lines.

Below is a map visualizing the assault incidents occurred in Toronto in 2023 and 4 TTC subway lines.

Figure 1: Map visualizing the assault incidents occurred in Toronto in 2023 and 4 TTC subway lines. Each red point represents a single assault incidents. The yellow line represents Line 1 subway route, green line represents Line 2 subway route, blue line represents Line 3 subway route, purple line represents Line 4 subway route. The grey polygon windows the Toronto region.

2.3 Spatial computations

We first produce 1km, 2km and 5km spatial buffers along the TTC subway routes. Regions inside the buffer are considered as near TTC subway routes and region outside the buffer are considered as not near TTC subway routes.

Then, we perform spatial join to merge the assaults dataset with the buffer dataset by geometry location intersection. Hence, assault incidents that occurred within the specific buffer area are considered as as “assault occurred near TTC subway area”. For assault incidents with location outside of the specific buffer area, we identify them as “assault occurred outside the TTC subway area”.

In other words, we create 3 additional binary covariates, each representing whether the assault incident occurred within the 1km, 2km and 5km spatial buffers respectively.

2.4 Basic summary statistics

2.41 Assaults incidents

In total, there were 23639 individual assault incidents recorded in Toronto in 2023. 10162 of them occurred within the TTC subway buffer, i.e. near TTC subway routes, this accounts for around 42.99% of the total number of assault incidents. 13477 assaults incidents occurred outside the TTC subway buffer, accounting for around 57.01% of the total number of assault incidents. Hence, there is a relatively fair split of assault incidents between near TTC subway routes and away from TTC subway routes. Below is a table showing the proportion of assault incidents occurred within each subway route buffer.

Summary of assaults cases occurred within TTC subway routes buffers
Buffer size Number of assaults within buffer Proportion of assaults within buffer (%)
1km 10162 42.99
2km 14648 61.97
5km 21618 91.45

Table 1: Summary of assaults cases occurred within TTC subway routes buffers. This table provides information to number and proportion of assault cases occurred within different buffers varied by radius size. As expected, we observe an increase in number and proportion of assault cases as the buffer size increases, with the 5km buffer covering over 90% of cases.

2.42 Subway lines

TTC Subway route buffer measurements
Subway route name Length Area (1km buffer) Area (2km buffer) Area (5km buffer)
LINE 1 (YONGE-UNIVERSITY) 38.89 [km] 99.30 [km^2] 189.44 [km^2] 458.27 [km^2]
LINE 2 (BLOOR - DANFORTH) 26.19 [km] 73.43 [km^2] 150.41 [km^2] 429.16 [km^2]
LINE 3 (SCARBOROUGH) 6.62 [km] 20.24 [km^2] 47.93 [km^2] 175.20 [km^2]
LINE 4 (SHEPPARD) 5.37 [km] 17.56 [km^2] 43.16 [km^2] 166.54 [km^2]

Table 2: TTC Subway route buffer measurements. This table lists out the length and buffer size of eachof the 4 TTC subway line.

Line 1 and Line 2 are significantly longer in length compared to Line 3 and Line 4. Accordingly, Line 1 and Line 2 buffers are also significantly larger in terms of area.

2.43 Assault cases summary by subway line and buffer size

Summary statistics of assault incidents by TTC subway routes
Subway route name Number of Assaults (1km buffer) Proportion % (1km buffer) Number of Assaults (2km buffer) Proportion % (2km buffer) Number of Assaults (5km buffer) Proportion % (5km buffer)
LINE 1 (YONGE-UNIVERSITY) 6385 27.01 8704 36.82 13737 58.11
LINE 2 (BLOOR - DANFORTH) 4452 18.83 8505 35.98 15777 66.74
LINE 3 (SCARBOROUGH) 741 3.13 1543 6.53 4354 18.42
LINE 4 (SHEPPARD) 558 2.36 900 3.81 2786 11.79

Table 3: Summary statistics of assaults cases by TTC subway routes. This table demonstrates the number and proportion of assault cases occurred within each subway line’s buffer. Each number/proportion column represents number/proportion of assault cases for a specific buffer size as indicated in the parentheses. Note: The sum of number of assaults across all subway line under the same buffer size does not necessarily add up to Table 2 total number of assaults figures. This is because assault incidents may possible fall into more than one line’s buffer zone and they are double counted in this table.

Overall speaking, we observe that regions (regardless of buffer size) near Line 1 has the highest number and proportion of assaults incidents, followed by regions near Line 2. Regions near Line 3 and Line 4 has significantly lower proportion of assaults incidents. These findings are consistent across all buffer size. As expected, regardless of which subway line, we observe an increase in number and proportion of assault cases as the buffer size increases

Assault incidents density by subway routes
Subway route name Density (1km buffer) Density (2km buffer) Density (5km buffer)
LINE 1 (YONGE-UNIVERSITY) 64.30 [1/km^2] 45.95 [1/km^2] 29.98 [1/km^2]
LINE 2 (BLOOR - DANFORTH) 60.63 [1/km^2] 56.55 [1/km^2] 36.76 [1/km^2]
LINE 3 (SCARBOROUGH) 36.61 [1/km^2] 32.19 [1/km^2] 24.85 [1/km^2]
LINE 4 (SHEPPARD) 31.78 [1/km^2] 20.85 [1/km^2] 16.73 [1/km^2]

Table 4: Assaults cases density by subway routes. This table demonstrates the densities of assault cases occurred within each subway line’s buffer. Each density column represents densities for a specific buffer size as indicated in the parentheses.

In 1km buffer, we observe that Line 1 and Line 2 subway route buffers have similar assault incidents densities. Additionally, Line 3 and Line 4 subway route buffers also have similar assault incidents densities, but they are nearly half of that observed from Line 1 and Line 2 buffers. In 2km and 5km buffers, we also observe that Line 1 and Line 2 subway route buffers have higher assault incidents densities compared to Line 3 and Line 4 subway route buffers’ densities. However, the differences between them are smaller than that found in 1km buffer. Overall speaking, as buffer size increases, regardless of which subway line, the assault incidents densities decreases. This implies that there may be less assault cases occurred in locations more distance away from the TTC subway lines.

3. Methods

To study if there are clustering of assault cases near TTC subway routes, point pattern analysis is conducted. We have created different buffers around the TTC subway routes, ranging from radius of 1 km, 2km, and 5 km (which can cover over 90% of data points as mentioned in Table 1).

3.1 Intensity analysis

Recall that we created 3 additional binary covariates, each representing whether the assault incident occurred within the 1km, 2km and 5km TTC subway buffers respectively. To get a rough understanding of how proximity to TTC subway routes relates to a point process, we conduct the Kolmogorov-Smirnov test, which is a non-parametric statistical test used to compare distributions. In our analysis, we use this test to evaluate whether the spatial distribution of assault cases is influenced by proximity to the TTC line by comparing the observed distribution of assault cases to the expected distribution based on a 1 km/ 2km/ 5km buffer zone around the TTC. Since we have 3 “within buffer” variables, we conducted 3 tests with

  • Null Hypothesis (\(H_0\)): The spatial distribution of assault cases is independent of the buffer covariate ( 1km/ 2km/ 5km “within buffer” variable). In other words, the assault cases are distributed uniformly with respect to the covariate, and there is no relationship between the point pattern and the buffer zone.
  • Alternative Hypothesis (\(H_a\)): The spatial distribution of assault cases is influenced by the buffer covariate ( 1km/ 2km/ 5km “within buffer” variable). This implies that there is some non-random relationship, such that the density of assault cases depends on their distance from or association with the buffer zone.

With the significant p-values, we may reject \(H_0\), suggesting that assault cases are spatially dependent on the covariate, indicating potential clustering or dispersion near the buffer. This approach helps determine if assault cases are randomly distributed or if they exhibit a spatial pattern related to the buffers, providing insight into potential clustering near the transit line.

3.2 Test for overall complete spatial randomness/clustering/regularity

We wish to test if all assault incidents are uniformly distributed across the entire Toronto region and are independent of each other, i.e. complete spatial randomness (CSR). CSR means an event (an assault incident) is equally likely to occur at any location or region within the domain (Toronto).

Quadrat counting

Firstly, we employ the technique of quadrat counting to visualize how the intensity of assault incidents varies across Toronto by creating a grid (often called quadrats) and counting the number of assault incidents in each grid cell. In general, if the point pattern follows CSR, we expect to observe random number of points across all quadrats; if the point pattern is clustered, we expect some quadrats have significantly higher number of points; if the point pattern is regular, we expect all quadrats to have similar number of points.

Additionally, we use the r function quadrat.test() to test for CSR, clustered (points are concentrated in some quadrats), or regular (points are evenly spaced across quadrats). In other words, we would conduct 3 quadrat.test() by specifying different alternative hypothesis:

  • Test 1: \(H_0: \text{The point pattern follows CSR}\) vs \(H_1: \text{The point pattern deviates from CSR (e.g., clustering or regularity).}\)
  • Test 2: \(H_0: \text{The point pattern follows CSR or regular pattern}\) vs \(H_1: \text{The point pattern follows a clustering pattern.}\)
  • Test 3: \(H_0: \text{The point pattern follows CSR or clustering pattern}\) vs \(H_1: \text{The point pattern follows a regular pattern.}\)

Since we conducted multiple testing (3 tests), we adjust the p-value threshold as \(0.05/3 = 0.016667\) to prevent inflated Type-I error. For each test, if the p-value obtained is lower than the threshold, we can reject the null hypothesis. In this study specifically, if clustering pattern is in fact present, we expect to obtain significant p-values for Test 1 and Test 3 and insignificant p-value for Test 2.

Ripley’s K function

To test for CSR (i.e. number of points are random across all quadrats), clustered (i.e. points are concentrated in some quadrats), or regular (i.e. all quadrats have similar number of points), we additionally calculate the Ripley’s K-function (with L function adjustment) on the full dataset without setting any buffer restriction.

The Ripley’s K function: \(K(r) = \lambda^{-1}E(N_0(r))\) where \(N_0(r)\) is the number of events within a distance h of an arbitrary event, represents the expected number of events within distance \(h\) from an arbitrary events (excluding the chosen event itself) divided by the average number of events per unit area. Under the null hypothesis CSR, \(K(r) = \pi r^2\). It tests whether the observed number of points within a given distance from any point in the dataset are significantly different from what would be expected under CSR. The L function is a transformation of the K function to make the interpretation easier. Specifically, \(L(r) = \sqrt{K(r) / \pi} - r\), which makes the expected value for a random pattern equal to 0 at all distances. The L function helps to linearize the K function, making it easier to compare the observed pattern with a random distribution. In general, positive values of \(L(r)\) suggest clustering, while negative values suggest regularity. Note that we impose boundaries correction for all Ripley’s K function estimations.

Ripley’s K function is applied to the complete dataset to test for complete spatial randomness (CSR), as ruling out CSR would validate the presence of spatial dependence and provide a basis for further clustering analysis (if L function produce a curve well above 0) of assault cases near the TTC line.

G-function

The G function is the cumulative distribution of the distances between nearest neighbors. The observed G function \(\hat{G}(r)\) is the proportion of observed points with nearest neighbors less than \(r\). Under CSR: \(G(r) = 1- e^{-\lambda \pi\ r^2}\). G test is applied to examine the distribution of nearest neighbors between points. It tests whether the observed distances between points in a pattern are significantly different from what would be expected under CSR by comparing the cdf between nearest neighbors in the observed data to that expected under CSR. If \(\hat{G}(r)\) is much greater than G(r), that means there is clustering, whereas if it is smaller that means there is regularity. Note that we also impose boundaries correction for the G function estimation.

The main difference between K function and G function is that: K function measures the number of events found up to a given distance of any particular event (i.e. uses pairwise distances) and tests on differences in terms of number of points, while G function measures the distribution of distances from an arbitrary event to its nearest event (i.e. uses nearest neighbor distances) and tests on differences in terms of distribution of nearest neighbors between points.

3.3 Comparing Ripley’s K function obtained from buffer-specific data

Next, we subset data points according to the 3 “within buffer” variables and obtain 3 sets of data, each set representing assault cases occurred within 1km/ 2km/ 5km TTC subway buffer respectively. For each data points subset, we refit the Ripley’s K function with L function adjustment. We conduct pairwise Kolmogorov-Smirnov test to statistically test the differences in Ripley’s K function curves obtained from different buffer sizes data to quantify the differences in the clustering patterns across different buffer sizes. Once again, since we are conducting multiple testing (3 tests) here, we need to adjust for the p-value threshold as \(0.05/3 = 0.016667\) in prevent inflated Type-I error. For each test, if all p-values obtained are lower than the threshold, we can reject the null hypothesis that Ripley’s K functions of data points within the 3 buffer zones are the same. This helps us to understand if the degree of clustering differ across buffer size. Note that we impose boundaries correction for all Ripley’s K function estimations.

3.4 Intensity Estimations of buffer-specific data

We use Kernel density estimation (KDE) to estimate the intensity function non-parametrically through kernel smoothing. The non-parametric form is \(\hat{\lambda(s)} = \frac{1}{h^2}\sum_i{K(\frac{||s-s_i||}{h})/q(||s||)}\) where \(K(s)\) is a kernel function and \(q||s||\) is a boundary correction, \(||s-s_i||\) is the distance between location \(s\) and observed point \(s_i\). In this analysis, we use an isotropic Gaussian kernel \(K(s) = \frac{1}{\sqrt{2\pi h^2}}exp(-\frac{s^2}{2h^2})\) where \(s\) is the distance from the point where the density is being estimated and h is the bandwidth that controls the degree of smoothing. Hence, estimating the density function is done by \(\hat{f(s)} = \frac{1}{n}\sum_i{K(\frac{s-s_i}{h)})} = \frac{1}{n}\sum_i\frac{1}{\sqrt{2\pi h^2}} exp(-\frac{(s-s_i)^2}{2h^2})\). Each data point \(s_i\) contributes a Gaussian-shaped bump to the density estimate, centered at \(s_i\) and with spread controlled by \(h\). The estimated density \(\hat{f}(s)\) at \(s\) is the average of these contributions. Since we do not know the kernel bandwidth, we estimate an optimal value using cross-validation by bw.diggle(). This function uses a leave-one-out cross-validation (LOOCV) criterion to estimate the optimal bandwidth, that is for each point in the spatial point pattern, it estimates the density at that point using the kernel density estimator, excluding the contribution of the point itself.

We estimate and plot the density on full dataset and the 3 “within buffer” data subsets to get varying density estimates within the buffers and visualize the differences across different buffer sizes.

3.5 Point process models

We fit several poisson process models on the full dataset using the 3 “within buffer” variables. Although poisson process models assume that incidents occurred under CSR environment (assault cases occur independently of one another) and with potential evidence of clustering obtained by Ripley’s K function analysis, we still opt for fitting poisson process model since it provides a simple baseline for assault incidents distributions and can serve as a reference point to quantify how much the actual data deviates from randomness, even if clustering is present. In clustered data, subsets of the data might still adhere to Poisson behavior. So, fitting the Poisson process locally or accounting for how different these subsets of data are (i.e. using the 3 “within buffer” variables) in the model can still yield useful information.

The first mode we fit is a homogeneous Poisson process model, assuming that the intensity is constant over the region without using any “within buffer” variable. Then, we fit 3 other inhomogeneous Poisson process models (which still assume CSR), assuming that the intensity is not constant over the region but is a function that varies spatially and depends only on the corresponding “within buffer” variable. In this case, the intensity function is defined as a log-linear model: \[ \lambda(x, y) = \exp(\beta_0 + \beta_1 \text{"within x km buffer"}), \] where:

  • \(\text{"within xkm buffer"}\) is the binary “within x km buffer” indicator (where x = 1/2/5)
  • \(\beta_0\), \(\beta_1\) are the model coefficients to be estimated. They determine how the intensity changes with respect to the buffer covariate. For instance, \(\beta_0\) represents the baseline log intensity and \(\beta_1\) represents the change in log intensity when the assault case occurred in the buffer specified by the \(\text{"within x km buffer"}\) variable.

After fitting the models, we can obtain parameter estimates and their confidence intervals for each model. In particular, for the inhomogeneous models with the “within buffer” variable as additional covariate, we use the Z-test result to determine whether they create significant differences to the intensity. If these “within buffer” variables are found significant, we may conclude that intensity does vary depending on buffer size, i.e. whether assault incidents occurred near (and how near) to TTC subway lines.

3.6 Clustering model

Lastly, we fit cluster process models, Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) using the hdbscan() function, to examine the clustering property of assault incidents. DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a clustering algorithm that groups points into dense regions of a given region. It uses two main parameters:

  1. eps: The maximum radius of a neighborhood for a point to be considered as part of a cluster.
  2. minPts: The minimum number of points required to form a cluster.

HDBSCAN on the other hand does not require \(\epsilon\) neighbourhood but still require the minimum number of points that we wish to have in a cluster. It uses the concept of mutual reachability, where we look at distances that connect all the points.

In this analysis, we specify the minimum number of points required to form a cluster as 100.

We conduct HDBSCAN on the full dataset and 3 “within buffer” subsets of data points. For each sets of clusters, we visualize them on a plot and compare between the plots to see if there are significant differences in the location, shape and size of clusters formed, especially those formed on the TTC buffer zones. Significant clustering patterns within or around the TTC subway routes buffer could suggest non-random spatial dependence, implying that TTC subway routes may influence the spatial distribution of assault cases. Note that we expect that as the number of points in the data increases, number and potentially the size of clusters would increase. We are particularly interested to see if the clusters formed on smaller buffer data subset would disappear as the data expand to the full dataset. For instance, using the 1km buffer data subset, clusters must occur on the 1km buffer region, we would check if these cluster would disappear in 2km buffer data, 5km buffer data or full dataset.

Ideally, if clustering does occur near TTC subway zones, we should expect similar clusters (in terms of location) obtained from smaller buffer data subsets (i.e. 1km and 2km). As buffer size increases (i.e. to 5km or using full dataset), these clusters should still retain (perhaps with increased size) and any additional clusters occurring outside of the 1km/ 2km buffer zones should be significantly smaller in size.

4. Results

4.1 Kolmogorov-Smirnov test based on buffer zone

The results of the Spatial Kolmogorov-Smirnov (KS) tests indicate a significant deviation from complete spatial randomness (CSR) in the distribution of assaults in Toronto when evaluated against covariates at different spatial scales (buffers of 1 km, 2 km, and 5 km). For the 1 km buffer, the test statistic \(D = 0.39218\) with a p-value < \(2.2 \times 10^{-16}\) suggests a strong departure from uniformity under CSR. Similarly, for the 2 km buffer, \(D = 0.38294\) with a p-value < \(2.2 \times 10^{-16}\) confirms this pattern. At the 5 km scale, the deviation is even more pronounced, with \(D = 0.49764\) and a p-value < \(2.2 \times 10^{-16}\). These results consistently reject the null hypothesis that the spatial distribution of assault cases is independent of the buffer covariate ( 1km/ 2km/ 5km “within buffer” variable). This suggests that assault cases are spatially dependent on the covariate, indicating potential clustering or regularity near the buffer.

4.2 Test for complete spatial randomness/clustering/regularity

4.21 Quadrat Counting

Figure 2: Quadrat Counting plot of Toronto assault cases. This Quadrat Counting plot illustrates the spatial distribution of assault cases in Toronto. Each cell in the grid represents the count of incidents within that area, revealing a clear pattern of clustering. Higher counts are concentrated in the central region, while peripheral areas show significantly lower or zero cases.

The Quadrat Counting plot (Figure 2) reveals that the distribution of assault cases is highly uneven, with distinct clusters of higher counts in the south, north-west and east regions, indicating areas of concentrated criminal activity. Note that those areas are consistent with area covered by Line 1 and Line 2 TTC subway routes. This pattern highlights spatial heterogeneity in assault occurrences, likely driven by underlying urban factors, such as the TTC subway lines.

4.22 Quadrat Test

The results of the Conditional Monte Carlo tests using quadrat counts indicate significant deviations from complete spatial randomness (CSR) and regular pattern.

Null Hypothesis Alternative Hypothesis p-value Significance
The point pattern follows CSR The point pattern deviates from CSR. 0.001 Significant (p < 0.0167)
The point pattern follows CSR or regular pattern The point pattern follows a clustering pattern 5e-04 Significant (p < 0.0167)
The point pattern follows CSR or clustering pattern The point pattern follows a regular pattern 1 Not Significant (p ≥ 0.0167)

Table 5: Quadrat tests setting and results. This table outlines the null hypothesis, alternative hypothesis, p-value and whether significant result is found for each test.

As mentioned earlier, since we are conducting multiple testing (3 tests) here, we asjusted for the p-value threshold as \(0.05/3 = 0.016667\) in prevent inflated Type-I error.

When testing the two-sided alternative hypothesis with a p-value of 0.001, we reject the null hypothesis of CSR and suggesting non-randomness in the distribution. The second test with regularity alternative hypothesis resulted in a p-value of 1, indicating that insufficient strength to reject the idea that the point pattern follows CSR or clustering pattern and there is no evidence of a regular spatial pattern. Conversely, the third test with clustering alternative hypothesis resulted in a p-value of 0.0005, strongly supporting the presence of a clustered spatial pattern. These results confirm that the observed data exhibit significant clustering rather than randomness or regularity.

4.23 Ripley’s K-function (with L function adjustment)

Figure 3: L-function plot for Toronto assault cases, with CSR envelopes. The black line indicates the observed \(K(r)\) curve and the red dashed line with grey evelopes is the expected function under CSR. Note that the observed \(K(r)\) curve (black line) consistently exceeds the CSR expectation (red dashed line within gray envelopes), indicating significant clustering at multiple spatial scales.

The L-function plot (Figure 3) demonstrates strong evidence of spatial clustering in Toronto assault cases. The observed \(K(r)\) values lie well above the CSR envelopes across all spatial scales (\(r\)) distance (meter), showing that the observed distribution deviates significantly from complete spatial randomness. This pattern suggests that assaults are not uniformly distributed but instead tend to occur in clusters. This finding is consistent with the quadrat count test results found in section 4.22.

4.23 G-function

Figure 4: G-function plot for Toronto assault cases, with CSR envelopes. The black line indicates the observed \(G(r)\) curve and the red dashed line with grey evelopes is the expected function under CSR. Note that the observed \(G(r)\) curve (black line) consistently exceeds the CSR expectation (red dashed line within gray envelopes), indicating significant clustering at multiple spatial scales.

The observed G-function (Figure 4) is significantly greater than theoretical CSR \(G(r)\) and generally remains outside the CSR envelopes, indicating that the spatial pattern of assaults does deviate significantly from randomness at the analyzed distances. It also suggests strong evidence of clustering in the distribution of assault cases. This finding is consistent with the test results found in section 4.22 Quadrat Test and section 4.23 K-function.

4.3 Comparing Buffer-specific Ripley’s K-function

Figure 5: L-function plot for Toronto assault cases occurred within the 1km subway buffer, with CSR envelopes. The black line indicates the observed \(K(r)\) curve and the red dashed line with grey evelopes is the expected function under CSR. Note that the observed \(K(r)\) curve (black line) consistently exceeds the CSR expectation (red dashed line within gray envelopes), indicating significant clustering at multiple spatial scales.

Figure 6: L-function plot for Toronto assault cases occurred within the 2km subway buffer, with CSR envelopes. The black line indicates the observed \(K(r)\) curve and the red dashed line with grey evelopes is the expected function under CSR. Note that the observed \(K(r)\) curve (black line) consistently exceeds the CSR expectation (red dashed line within gray envelopes), indicating significant clustering at multiple spatial scales.

Figure 7: L-function plot for Toronto assault cases occurred within the 5km subway buffer, with CSR envelopes. The black line indicates the observed \(K(r)\) curve and the red dashed line with grey evelopes is the expected function under CSR. Note that the observed \(K(r)\) curve (black line) consistently exceeds the CSR expectation (red dashed line within gray envelopes), indicating significant clustering at multiple spatial scales.

According to Figure 5-7, the L-function plots demonstrate strong evidence of spatial clustering in Toronto assault cases, regardless of using which buffer size specific data subsets. Across all buffer sizes, the \(K(r)\) values significantly exceed the expected \(K(r)\) under complete spatial randomness (gray shaded region), indicating clustering rather than randomness. The clustering effect is most pronounced in the 1 km buffer, where \(K(r)\) rises rapidly and peaks earlier compared to the other buffers. The 2 km buffer plot shows a similar pattern but with a slightly reduced clustering intensity and a broader peak. The 5 km buffer exhibits the least steep increase, indicating that same degree of clustering spatial pattern occurred at larger scales. These differences suggest that the observed clustering becomes less concentrated as the buffer size increases, which might reflect varying scales of spatial dependence or density across the study area.

We employ Kolmogorov-Smirnov test to statistically test the differences in Ripley’s K function curves obtained from different buffer sizes data to quantify the differences in the clustering patterns across different buffer sizes. As mentioned earlier, since we are conducting multiple testing, we adjust for the p-value threshold as \(0.05/3 = 0.016667\) in prevent inflated Type-I error. Here is a summarized table of the results:

Buffer Comparison Test Statistic (D) p-value Significance
1 km vs. 2 km 0.68031 < 2.2 × 10\(^{-16}\) Highly Significant
2 km vs. 5 km 0.28265 < 2.2 × 10\(^{-16}\) Highly Significant
1 km vs. 5 km 0.53996 < 2.2 × 10\(^{-16}\) Highly Significant

Table 6: Kolmogorov-Smirnov tests results. This table outlines the pairwise function curves comparison, test statistic, p-value and whether significant result is found for each test.

The results confirm significant differences in clustering patterns between all buffer sizes, with the largest difference observed between 1 km and 2 km buffers followed by 1 km vs. 5 km buffers. These findings suggest that the spatial clustering patterns may vary substantially depending on the buffer size, which has implications for the clustering models HDBSCAN. Specifically, the choice of buffer size data subset may strongly influence the resulting cluster structure and the scale at which clusters are identified.

4.4 Kernel density estimation (KDE)

Figure 8: Estimated density: 1km buffer Toronto assault cases. This kernel density estimation plot shows the overall spatial distribution of assault cases occurred within 1km buffer of TTC subway lines. Darker blue areas represent lower densities, while warmer colors (pink, yellow) signify higher densities of assault cases.

Figure 9: Estimated density: 2km buffer Toronto assault cases. This kernel density estimation plot shows the overall spatial distribution of assault cases occurred within 2km buffer of TTC subway lines. Darker blue areas represent lower densities, while warmer colors (pink, yellow) signify higher densities of assault cases.

Figure 10: Estimated density: 5km buffer Toronto assault cases. This kernel density estimation plot shows the overall spatial distribution of assault cases occurred within 5km buffer of TTC subway lines. Darker blue areas represent lower densities, while warmer colors (pink, yellow) signify higher densities of assault cases.

Figure 11: Estimated density: all Toronto assault cases. This kernel density estimation plot shows the overall spatial distribution of all assault cases across Toronto. Darker blue areas represent lower densities, while warmer colors (pink, yellow) signify higher densities of assault cases.

Optimal bandwidth :

  • 1km buffer data: 3.19m, indicating spatial variations at scale of 3.19m
  • 2km buffer data: 2.91m, indicating spatial variations at scale of 2.91m
  • 5km buffer data: 2.81m, indicating spatial variations at scale of 2.81m
  • Full data: 2.93m, indicating spatial variations at scale of 2.93m
  • 5km buffer data has has a lowest optimal bandwidth, indicating a most estimated localized pattern.

Density estimates:

  • 1km buffer data: the mean density is \(2.983404 \times 10^{-5}\) with range [-1.160475e-18, 0.006592689]
  • 2km buffer data: the mean density is \(3.592894 \times 10^{-5}\) with range [-7.897008e-19, 0.005970451]
  • 5km buffer data: the mean density is \(3.855214 \times 10^{-5}\) with range [-1.526693e-18, 0.007114782]
  • Full data: the mean density is \(3.552336 \times 10^{-5}\) with range [-8.361371e-19, 0.005007728]
  • 5km buffer data has the highest mean and wider range of density estimates than other buffer data subset or full data.

Density plots:

Overall speaking, these is no too much differences in the density estimates (Figures 8-11) across all 4 set of data. In general, we observe higher density areas concentrated in downtown Toronto. There are slightly more locations in downtown area with higher density estimates based on smaller buffer size data subsets (Figures 8-9). Meanwhile, using the 5km buffer size data subset and full dataset, we observe another high density location (indicated as a yellow spot) at the south-west corner of Toronto (Figures 10-11).

4.5 Point process model

We fit several poisson process models on the full dataset. Below is a summary of the model results:

Model Parameter term Coefficient estimate SE of Estimate Confidence interval of Estimate AIC
Homogeneous model (constant intensity) Intercept −10.24358 0.006504074 [-10.25632, -10.23083] 531575.7
Inhomogeneous model with “within 1km buffer” variable Intercept −10.5535641 0.008616209 [-10.5704515, -10.53668] 526388.6
within 1km buffer 0.9757522 0.013136862 [0.9500044, 1.00150]
Inhomogeneous model with “within 2km buffer” variable Intercept −10.6621427 0.01053683 [-10.6827945, -10.6414909] 527691.3
within 2km buffer 0.8090311 0.01339284 [0.7827816, 0.8352806]
Inhomogeneous model with “within 5km buffer” variable Intercept −11.0634765 0.02215667 [-11.1069028, -11.0200502] 529296.1
within 2km buffer 0.9500268 0.02317779 [0.9045992, 0.9954544]

Table 7: Fitted Point Process models results. This table outlines the model description, coefficient estimates, standard error and confidence interval for each parameter, as well as the AIC value for each model.

We interpret the intercept estimates as baseline log intensity and the buffer variable coefficient estimates as change in log intensity when the assault case occurred within that specific buffer. Additionally, all intercept estimates and buffer variable were tested as statistically significant (by Z-test), implying that the baseline log intensity is non-zero and buffer variables create significant differences to the intensity.

The results demonstrate that proximity to TTC subway routes significantly influences the density of assault cases in Toronto. The inhomogeneous model with a “within 1km buffer” variable provides the best fit to the data, with the lowest AIC value, indicating that areas within 1km of transit infrastructure have the highest relative risk of assault. Specifically, assault cases are 2.7 times more likely to occur within this zone compared to areas outside the buffer, as evidenced by a log estimate of 0.976. The effect diminishes as the buffer expands, with assaults being 2.2 times more likely within 2km and 2.6 times more likely within 5km, though the fit of these models is slightly poorer. These findings highlight the spatial variability of assault density and suggest that targeted safety interventions should focus on areas close to transit infrastructure, particularly within the 1km radius, to maximize their impact on public safety.

4.6 HDBSCAN

Figure 12: Clustering of assault incidents within 1 km of TTC subway routes. Each cluster consist of at least 100 data points. 15 clusters are formed.

Figure 13: Clustering of assault incidents within 2 km of TTC subway routes. Each cluster consist of at least 100 data points. 18 clusters are formed.

Figure 14: Clustering of assault incidents within 5 km of TTC subway routes. Each cluster consist of at least 100 data points. 30 clusters are formed.

Figure 15: Clustering of all assault incidents in Toronto. Each cluster consist of at least 100 data points. 33 clusters are formed.

Figure 12-14 represent clustering analysis of assault incidents occurring within 1 km/ 2km/ 5km of TTC subway routes in Toronto, while Figure 15 represent that of all Toronto assault incidents. Each cluster consists of at least 100 data points. Each colored cluster indicates a distinct geographical grouping of incidents, identified using spatial clustering method HDBSCAN. The clusters highlight areas with higher concentrations of assaults, suggesting potential hotspots near subway routes. The black points scattered across the map denote individual incident locations that are not clustered and are considered as noise point, while the color-coded regions provide insights into patterns of spatial proximity and density.

1km and 2km buffer datasets (Figures 12-13) generate clusters (15 and 18 respectively) at similar location while clusters in 2km buffer are slightly larger in size. As we increased the buffer size to 5km and lastly to the full dataset, those identified near TTC subway route clusters using 1km and 2km buffer data in general retain and slightly increase in size (which is expected given more points are available in larger datasets). Note that there are non-negligible of additional clusters that are observed in 5km buffer data or full data but not in 1km or 2km buffer data (almost double the amount of clusters were discovered with larger dataset). However, they are in general smaller in size compared to the main near TTC subway route clusters. There are two significantly large additional clusters found on the south-west corner of Toronto.

Overall speaking, significant clustering patterns found within or around the TTC subway routes 1km/ 2km buffer still retains as we expand the study area to the entire Toronto region, implying that TTC subway routes may influence the spatial distribution of assault cases.

5. Conclusion and Discussion

5.1 Key findings

The analysis reveals significant spatial clustering of assault cases in Toronto, particularly influenced by proximity to TTC subway routes. Firstly, Kolmogorov-Smirnov tests concludes that intensity of assault cases depends on whether they occurred with the 1 km, 2 km, and 5 km buffer zones. Quadrat, Ripley’s K-function and G-function tests further reject the null hypothesis of complete spatial randomness (CSR) and confirm the presence of clustering in the assault incident point pattern. Comparison of Ripley’s K-function on different within buffer subsets reveals that clustering patterns depend on the buffer size as well (i.e. proximity to TTC subway route). Secondly, kernel density estimation shows that downtown Toronto region is consistently estimated with high density of assault incidents across all within buffer subsets of data points. Thirdly, poisson process models incorporating thebuffer covariates outperform the homogeneous model, indicating that spatial variation in assault incidents correlates with proximity to subway routes. Lastly, the HDBSCAN clustering analysis revealed significant concentrations of assault incidents near TTC subway routes in Toronto, with distinct clusters identified within 1 km and 2 km buffers. These clusters retained their general location and slightly increased in size as the buffer was expanded to 5 km and the entire city, indicating that the spatial influence of subway routes persists across broader study areas. Larger datasets also revealed additional clusters, particularly in the southwest of Toronto, though these were smaller in size compared to the main clusters near subway routes.

In conclusion, the findings of this analysis indicate that assault incidents in Toronto exhibit a clustering point pattern rather than being randomly distributed across the city. This clustering suggests that certain areas experience disproportionately high levels of such incidents, creating identifiable “hotspots” that could warrant further investigation. The spatial relationship between these clusters and the Toronto Transit Commission (TTC) subway routes suggests that the transit system may play a significant role in influencing the distribution of assault incidents. This influence could stem from several factors, including high population density and increased foot traffic around subway stations, which often serve as hubs of activity. Additionally, the movement of individuals along these transit routes could create opportunities for encounters, both positive and negative, within these high-traffic zones.

Identifying and understanding these hotspots is crucial for developing targeted interventions aimed at reducing assault incidents. Such interventions might include enhancing security measures at or near subway stations, increasing surveillance or police presence, improving lighting and visibility in these areas, or fostering community engagement to address underlying social issues. Overall, this research underscores the importance of integrating spatial analysis into urban planning and public safety strategies. By recognizing and addressing the spatial dynamics of assault incidents, stakeholders can implement more effective and localized solutions to enhance safety and security for all residents and commuters in Toronto.

5.2 Limitation

While this study provides valuable insights into the spatial clustering of assault incidents near TTC subway routes, several limitations must be acknowledged. First, the analysis is constrained to reported assault incidents, which may underrepresent the true number of assaults due to underreporting or data recording discrepancies. Second, the study assumes that proximity to subway routes directly correlates with transit-related activity, but it does not account for other factors that may influence assault occurrences, such as socio-economic conditions, land use, or time-of-day variations. Third, the static nature of the buffer zones does not consider dynamic population movements or fluctuations in transit ridership, which may affect assault patterns. Additionally, the reliance on a single year of data limits the ability to identify long-term trends or temporal variations. Finally, spatial modeling techniques may be influenced by assumptions about the underlying distribution of assault incidents, potentially oversimplifying complex interactions. These limitations highlight the need for further research incorporating additional data sources, temporal analyses, and contextual variables to provide a more comprehensive understanding of crime patterns near urban transit systems.